In this part, we need to distinguish supervised and unsupervised machine learning methods. Supervised machine learning methods use the information of the label or the category of a pattern of data. Unsupervised machine learning methods aim to find a pattern of groups inside the dataset.
Detection of a sub-population of cancer cells that drive agressiveness.
Isolation and study of neuroblastoma cells that drive phenotypes of agressiveness.
Hierarchical Clustering is one of the ML methods that have been used to identify the pattern of gene expression of both ADRN and MES cells.
Clustering can be used to separate the genetic profiles of cancer cells that drive progression or defeat of the disease.
ADRN neuroblastoma cells are differentiated cells similar to neurons that are killed by chemiotherapy.
MES neuroblastoma cells are non-differentiated cells thought to behave like stem cells.
MES cells are resitent to chemotherapy.
Super-enhancers are regions of chromatin that are maintained open to facilitate expression of proteins that keep the differentiation state of the neuroblastoma cells.
The goal of this project was the contruction of a dataframe and a database with the frequency of several mutations of SARS-CoV-2 in different countries;
We constructed mutation frequency vector and frequency table informing the percentage of samples containing specific SARS-CoV-2 mutations.
heatmap_table <- read.table("./data/temp_file.txt", row.names = 1, header = TRUE, sep = "\t")
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.2
kable(heatmap_table, caption="Frequency Table of SARS-CoV-2 SNPs") %>%
kable_styling("striped", full_width = F, font_size = 12) %>%
scroll_box(width = "100%", height = "400px")
| Bat | Pangolin | Morocco | United.States | Germany | China | Australia | Brazil | Italy | Kenya | France | Caribbean | Singapore | Vietnam | Switzerland | Ghana | Taiwan | Argentina | Saudi.Arabia | Canada | Finland | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 241 C > T | 0 | 0 | 100 | 88 | 11 | 0 | 11 | 100 | 100 | 20 | 100 | 50 | 20 | 78 | 100 | 33 | 30 | 100 | 100 | 100 | 100 |
| 3037 C > T | 67 | 0 | 83 | 88 | 11 | 0 | 11 | 100 | 100 | 60 | 100 | 50 | 20 | 78 | 90 | 33 | 30 | 100 | 100 | 100 | 100 |
| 11083 G > T | 0 | 32 | 0 | 13 | 0 | 0 | 56 | 0 | 0 | 30 | 22 | 50 | 70 | 11 | 0 | 0 | 40 | 0 | 0 | 10 | 0 |
| 14408 C > T | 0 | 0 | 100 | 88 | 11 | 0 | 11 | 100 | 100 | 60 | 100 | 50 | 20 | 78 | 90 | 33 | 20 | 90 | 90 | 100 | 100 |
| 17747 C > T | 0 | 0 | 0 | 0 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17858 A > G | 0 | 0 | 0 | 0 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 18060 C > T | 67 | 58 | 0 | 0 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 23403 A > G | 0 | 0 | 100 | 88 | 11 | 11 | 11 | 100 | 100 | 60 | 100 | 50 | 20 | 78 | 100 | 33 | 10 | 90 | 80 | 100 | 90 |
| 26144 G > T | 0 | 0 | 0 | 0 | 0 | 0 | 33 | 0 | 0 | 10 | 0 | 50 | 0 | 11 | 0 | 0 | 50 | 0 | 0 | 0 | 0 |
| 27046 C > T | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 28144 T > C | 33 | 58 | 0 | 0 | 0 | 56 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 30 | 0 | 0 | 0 | 0 |
| 28881 G > A | 0 | 0 | 0 | 13 | 11 | 0 | 0 | 29 | 56 | 10 | 0 | 0 | 0 | 44 | 50 | 22 | 20 | 40 | 0 | 0 | 0 |
| 28882 G > A | 0 | 0 | 0 | 13 | 11 | 0 | 0 | 29 | 56 | 10 | 0 | 0 | 0 | 44 | 50 | 22 | 20 | 40 | 0 | 0 | 0 |
| 28883 G > C | 0 | 0 | 0 | 13 | 11 | 0 | 0 | 29 | 56 | 10 | 0 | 0 | 0 | 44 | 50 | 22 | 20 | 40 | 0 | 0 | 0 |
Before plotting the heatmap of mutation frequency, let’s look at the SNP nucleotide mutations at their positions:
Como verá a continuación, la normalización nos permite visualizar la agrupación de secuencias de murciélagos, pangolines y virus aislados en el la Lejano Oriente.
La normalización nos permite hacer una comparación con el analysis filogenético
library("pheatmap")
library("RColorBrewer")
heatmap_table = as.matrix(heatmap_table)
pheatmap(heatmap_table, cluster_rows = F)
col.pal <- brewer.pal(9,"Blues")
pheatmap(heatmap_table, cluster_rows = F, col.pal)
The distance matrix is calculated using the Euclidian distance between two points.
The R package pheatmap allows us to define the type of distance between two points.
drows1 <- "euclidean"
dcols1 <- "euclidean"
Change font size, cells height and width
Turn on clustering for the columns and the rows
hm.parameters <- list(heatmap_table,
color = col.pal,
cellwidth = 14, cellheight = 12, scale = "none",
treeheight_row = 100,
kmeans_k = NA,
show_rownames = T, show_colnames = T,
#main = "Full heatmap (avg, eucl, unsc)",
main = "Frequencies of SNP Variants of SARS-CoV-2",
clustering_method = "average",
####
####
#### SNP mutations are in the rows
cluster_rows = T,
####
####
#### Different countries are in the columns
cluster_cols = T,
clustering_distance_rows = drows1,
fontsize_row = 10,
fontsize_col = 10,
clustering_distance_cols = dcols1)
do.call("pheatmap", hm.parameters)
Note that SNPs that are often together are present in the same cluster.
SNPs that are very close (28881, 28882, and 28883) form a single cluster.
The clustering part in r was based on the article of Kodali (2016).
clusters <- hclust(dist(heatmap_table))
plot(clusters)
library("pheatmap")
library("RColorBrewer")
heatmap_table <- read.table("./data/temp_file.txt", row.names = 1, header = TRUE, sep = "\t")
heatmap_table = as.matrix(heatmap_table)
# Define a normalization transformation
log_table_09_18_2020 = log (heatmap_table + 1)
# Plot heatmap using pheatmap function
pheatmap(log_table_09_18_2020)
# Escolher a cor do heatmap
col.pal <- brewer.pal(9,"Blues")
# Definir o tipo de correlacao entre as amostras (colunas) e os genes (linhas)
drows1 <- "euclidean"
dcols1 <- "euclidean"
#Plotar o heatmap, com as diversas opcoes determinadas
hm.parameters <- list(log_table_09_18_2020,
color = col.pal,
cellwidth = 14, cellheight = 12, scale = "none",
treeheight_row = 200,
kmeans_k = NA,
show_rownames = T, show_colnames = T,
#main = "Full heatmap (avg, eucl, unsc)",
main = "Frequencies of SNP Variants of SARS-CoV-2",
clustering_method = "average",
cluster_rows = F, cluster_cols = T,
clustering_distance_rows = drows1,
fontsize_row = 10,
fontsize_col = 10,
clustering_distance_cols = dcols1)
do.call("pheatmap", hm.parameters)
clusters <- hclust(dist(log_table_09_18_2020))
plot(clusters)
library(stats)
plot(as.dendrogram(clusters))
Highlight cluster as described r-charts (n.d.).
plot(as.dendrogram(hclust(dist(log_table_09_18_2020))))
rect.hclust(hclust(dist(log_table_09_18_2020)), k = 2,
border = 3:10)
Some mutations or variants provide the virus with increased evolutionary success because of increased transmissibility resulting the interaction between the humna ACE2 protein and the viral Spike protein.
Polar interactions between the SARS-CoV-2 Spike RBD protein (white) and the human ACE2 protein (blue) calculated by Pymol using the mutagenesis tool.